DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418

pshapiro4broad · 2025-03-11T15:09:51Z

Increase the datarepo tools pool size to prevent tests from blocking or failing unnecessarily.

This value was changed from 1500 to 2000: #293
And then was changed from 2000 to 1000: #374

The usage of this pool depends heavily on how much work is being done on TDR. If there are more than three developers creating one or two branches each, a pool size of 1000 can be exhausted due to concurrent test execution. In addition to more activity, recent changes to allow integration tests to run in parallel, and to avoid needing to "lock" an integration host to run tests on, may have put a greater burden on this resource.

Here's the pool usage graph of the last 7 days

The exhaustion spike occurred yesterday, and the capacity dipped down to 15% last week as well, so this usage doesn't seem atypical. Graphana's data only goes back 10 days so there's no way to easily see long term trends.

sonarqubecloud · 2025-03-11T15:10:28Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fboulnois

👍

davidangb

👍 as long as Justin is ok with it

jyang-broad · 2025-03-17T16:39:16Z

I'm not necessarily against this if it's breaking, but I do want to pause a moment to diagnose what's causing this. According to Phil's screenshot as well as one i took below

It looks like there's a noted drop in buffer availablility at 3pm ET Mondays followed by a lesser one on tuesday. Is there some sort of scheduled testing that happens weekly?

If this is totally normal activity, then that's fine, but it's a little notable that this seems to look like a strong spike in testing that extends past 5pm.

pshapiro4broad · 2025-03-17T17:32:49Z

It looks like there's a noted drop in buffer availablility at 3pm ET Mondays followed by a lesser one on tuesday. Is there some sort of scheduled testing that happens weekly?

If this is totally normal activity, then that's fine, but it's a little notable that this seems to look like a strong spike in testing that extends past 5pm.

I think what you're seeing is an artifact of how developers are typically working. Each usage spike correlates directly with tests being run. It's likely that many developers are pushing changes before they sign off for the day, which causes the spike to extend after 5pm. When the tests run normally, they take about an hour to run. With retries they can take longer, with a worst case being when the RBS pool is exhausted and the tests will wait until resources are ready within the 90 minute test timeout window.

You can see the history of the tests here https://github.com/DataBiosphere/jade-data-repo/actions/workflows/int-and-connected-test-run.yml

And the nightly test configuration is

  schedule:
    - cron: '0 4 * * *' # run at 4 AM UTC, 12PM EST.

TL;DR as far as I can tell this is normal (expected) activity for this pool usage.

jyang-broad

Audited the test runs, it appears that one test run uses roughly 10% of the pool which is about 100 workspaces (I believe I've heard this number is actually about 150 but is mitigated by workspace recovery during the test). this occurs over roughly 20 min, Buffer takes about 1 hour to recover 100 those 100 workspaces.

So about ~7-10 test runs over 1-5 hours would bankrupt the pool, which is what I saw seen.

I expect that bumping this should allow for 14-20 test runs over 5 hours.

Increase (revert back) TDR tools pool size from 1000 to 2000

7802a58

pshapiro4broad requested a review from a team as a code owner March 11, 2025 15:09

pshapiro4broad requested review from davidangb, marctalbott and snf2ye and removed request for a team March 11, 2025 15:09

marctalbott approved these changes Mar 11, 2025

View reviewed changes

snf2ye approved these changes Mar 11, 2025

View reviewed changes

fboulnois approved these changes Mar 11, 2025

View reviewed changes

pshapiro4broad requested a review from jyang-broad March 11, 2025 15:13

davidangb approved these changes Mar 11, 2025

View reviewed changes

jyang-broad approved these changes Mar 17, 2025

View reviewed changes

pshapiro4broad merged commit e59e3f6 into master Mar 17, 2025
6 checks passed

pshapiro4broad deleted the ps/dt-1342-increase-datarepo-tools-pool branch March 17, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418

DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418

Uh oh!

pshapiro4broad commented Mar 11, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Mar 11, 2025

Uh oh!

fboulnois left a comment

Uh oh!

davidangb left a comment

Uh oh!

jyang-broad commented Mar 17, 2025 •

edited

Loading

Uh oh!

pshapiro4broad commented Mar 17, 2025 •

edited

Loading

Uh oh!

jyang-broad left a comment

Uh oh!

Uh oh!

Uh oh!

DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418

DT-1342: Increase (revert back) TDR tools pool size from 1000 to 2000 #418

Uh oh!

Conversation

pshapiro4broad commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 11, 2025

Quality Gate passed

Uh oh!

fboulnois left a comment

Choose a reason for hiding this comment

Uh oh!

davidangb left a comment

Choose a reason for hiding this comment

Uh oh!

jyang-broad commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pshapiro4broad commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jyang-broad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pshapiro4broad commented Mar 11, 2025 •

edited

Loading

jyang-broad commented Mar 17, 2025 •

edited

Loading

pshapiro4broad commented Mar 17, 2025 •

edited

Loading